docs: add operations documentation guides by WentingWu666666 · Pull Request #309 · documentdb/documentdb-kubernetes-operator

WentingWu666666 · 2026-03-12T20:31:21Z

This PR adds 5 operations documentation guides for the DocumentDB Kubernetes Operator, covering day-to-day cluster management tasks.

New Documentation

Guide	Description
Failover	Automatic local replica promotion, cross-cluster failover for multi-region, and application connection considerations
Upgrades	Operator Helm chart and CRD upgrades, per-cluster DocumentDB extension and gateway image updates, and rollback procedures
Backup & Restore	On-demand and scheduled VolumeSnapshot backups, restore from backup, and retention policy configuration
Restore Deleted Cluster	Recovery via VolumeSnapshot backup restore or retained PersistentVolume reattachment
Maintenance	Cluster health monitoring, PostgreSQL and gateway log review, resource usage tracking, and Kubernetes events/alerts

Verification

Every command, event name, label, container name, and path in these docs was verified against:

Source code audit of the operator controllers
Live testing in a local Kind cluster (Kubernetes v1.35.0)

Key decisions

Scaling doc moved to separate branch (wentingwu/scaling-docs) blocked on issue Reconciliation loop does not propagate spec changes to existing CNPG clusters #306 (reconciliation loop doesn't propagate spec changes to existing clusters)
CRD upgrade step uses --server-side --force-conflicts plain kubectl apply fails for the large dbs.documentdb.io CRD
CRD upgrade placed before helm upgrade ensures new CRD fields are available when the operator starts
No spec.resources references DocumentDB CRD only has spec.resource.storage, not CPU/memory limits

Also includes

Updated CONTRIBUTING.md with MkDocs documentation testing instructions
Updated mkdocs.yml navigation (removed scaling.md)

Closes #253

Add six new operations guides covering day-2 cluster management: - backup-and-restore: conceptual overview, on-demand/scheduled backups, restore workflow, retention policy, and troubleshooting - scaling: vertical scaling (instancesPerNode 1-3) and PVC storage expansion with prerequisites and monitoring - upgrades: operator, extension, and gateway upgrade procedures, rolling update behavior, and rollback protection - failover: local automatic and cross-cluster manual failover, testing procedures, and application connection considerations - restore-deleted-cluster: recovery from backup or retained PV, verification steps, and common pitfalls - maintenance: monitoring, log management, resource tuning, node maintenance, rolling restarts, and routine checklists Update mkdocs.yml with new Operations navigation section. Refs documentdb#253 Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

- Add YAML front matter (title, description, tags) to all 6 operations docs - Rewrite Overview sections: what the operation is + why it matters - Disambiguate all bare 'cluster' to 'DocumentDB cluster' or 'Kubernetes cluster' - Disambiguate 'operator' to 'DocumentDB operator' in upgrades doc - backup-and-restore: add CSI link, multi-region section, tabbed prerequisites, YAML block titles, restore constraints, cross-ref to networking for mongosh - restore-deleted-cluster: route Method 1 to backup-and-restore, remove internal details section, add YAML title, cross-ref to networking for verify step - scaling: replace unsupported storage expansion with link to storage config - upgrades: remove unnecessary backup step from operator upgrade, replace heredoc with YAML block, use placeholder versions instead of fake 1.2.0 - maintenance: fix broken link to removed storage-expansion anchor - Update configuration front matter descriptions to match actual content Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Replace "CNPG monitors/promotes/triggers" with "the operator monitors/promotes/triggers" in prose explanations across all operations docs (failover, maintenance, scaling, upgrades, backup-and-restore). Resource names like clusters.postgresql.cnpg.io are preserved in kubectl commands that users need to run. Also restructures several sections into Material for MkDocs tabbed format for improved readability and fixes the troubleshooting namespace reference from cnpg-system to documentdb-operator. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Refine all operations documentation based on review feedback: - scaling.md: mirror structure for scale up/down tabs, fix "at least 2" for failover, remove unnecessary checklist - failover.md: fix networking cross-reference anchor, remove false connection pooling/quorum claims, fix replica read claim - upgrades.md: merge extension+gateway into single component upgrade (documentDBVersion upgrades both), move pre-upgrade checklist under component upgrades, simplify overview table, remove cluster health check from operator verify - backup-and-restore.md: convert on-demand/scheduled to tabs with API refs, fix CSI prerequisite wording, add YAML title, update retention policy to table format, improve backup identification step - maintenance.md: clarify logLevel scope (PostgreSQL only), remove fake resource allocation table, add PVC resize planned note, clarify cordon terminology - restore-deleted-cluster.md: fix broken anchor references - mkdocs.yml: reorder nav (failover before upgrades) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

- failover: fix misleading write-only downtime claim to cover both reads and writes, add playground links for cross-cluster failover, explain instancesPerNode >= 2 requirement explicitly, merge behavior sections - maintenance: add normal/investigate guidance for each maintenance task so users know what to expect and when to troubleshoot - upgrades: add rollback sections with schema version check guidance (rollback if schema not upgraded, otherwise restore from backup) All failover doc claims verified against source code and tested in Kind cluster (3-instance cluster, primary deletion triggers automatic failover with data preservation confirmed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Scaling operations (instancesPerNode, pvcSize changes) do not propagate to existing CNPG clusters due to the reconciliation loop gap documented in issue documentdb#306. Moving scaling doc to a separate branch until the operator bug is fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Upgrade doc fixes verified against source code and Kind cluster: - Fix downgrade behavior: operator skips schema migration but still updates images (not 'rejects the change') - Fix rolling update: primaryUpdateMethod=restart means primary is restarted in place (no switchover) - Fix health check: operator checks primary pod health, not all pods - Fix CRD handling: Helm crds/ dir only applies on install, not upgrade - Remove misleading 'zero-downtime' from description Maintenance doc cleanup: - Remove CNPG-internal Advanced Diagnostics section - Remove troubleshooting section with CNPG-specific commands - Remove broken link to scaling doc (moved to separate branch) - Reorganize Routine Checks section placement Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

- Fix CRD URL from microsoft/ to documentdb/ GitHub org - List all 3 CRDs (dbs, backups, scheduledbackups) instead of just 1 - Fix image override examples to use correct repo path: ghcr.io/documentdb/documentdb-kubernetes-operator/documentdb ghcr.io/documentdb/documentdb-kubernetes-operator/gateway All 24 claims in the upgrade doc verified against source code and local Kind cluster. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

- Fix backup status from 'Succeeded' to 'completed' (actual phase value) - Add missing metadata.name field to on-demand backup YAML example - Apply same status fix in restore-deleted-cluster doc Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

…enance doc The DocumentDB CRD has spec.resource.storage (for PVC config) but no spec.resources.limits for CPU/memory. Replace with generic guidance based on kubectl top output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

The DocumentDB CRD (dbs.documentdb.io) exceeds the annotation size limit for client-side kubectl apply, causing 'metadata.resourceVersion: Invalid value: 0' errors. Switch to --server-side --force-conflicts which avoids this limitation. Verified in Kind cluster: CRD apply, helm upgrade (test->dev), and helm rollback all tested successfully with zero DocumentDB cluster disruption. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

- Remove non-existent INSTANCES column from kubectl get documentdb table - Fix pod label selector from documentdb.io/cluster to app=<cluster-name> - Fix PG log path from postgresql.log to /controller/log/postgres - Fix gateway container name from gateway to documentdb-gateway - Replace non-existent BackupSucceeded event with real BackupSchedule event - Replace non-existent FailoverCompleted event with real InvalidSchedule event - Fix PVRetained event name to PVsRetained (plural, matches source code) All fixes verified against Kind cluster and operator source code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

The scaling operations doc was moved to the wentingwu/scaling-docs branch pending resolution of issue documentdb#306. Remove the nav entry to avoid a broken link in the docs build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Update the YAML description field in each operations doc so it accurately summarises the sections in that file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Copilot

Pull request overview

This PR expands the public “Preview” documentation by adding a new Operations section (failover, upgrades, backup/restore, restore-deleted-cluster, maintenance) and updates several existing configuration page descriptions for clarity.

Changes:

Adds new Operations documentation pages under docs/operator-public-documentation/preview/operations/.
Updates mkdocs.yml navigation to surface the new Operations section.
Refines YAML frontmatter description text for networking, TLS, and storage configuration docs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
mkdocs.yml	Adds “Operations” nav entries to expose new operational guides (but currently also keeps an existing “Backup and Restore” entry at the same level).
docs/operator-public-documentation/preview/operations/upgrades.md	New guide describing operator vs component upgrades and rollback considerations.
docs/operator-public-documentation/preview/operations/failover.md	New failover guide covering local and multi-region/cross-cluster promotion.
docs/operator-public-documentation/preview/operations/backup-and-restore.md	New backup/restore guide using VolumeSnapshots and Backup/ScheduledBackup CRs.
docs/operator-public-documentation/preview/operations/restore-deleted-cluster.md	New recovery guide describing restore via Backup or retained PVs.
docs/operator-public-documentation/preview/operations/maintenance.md	New maintenance guide covering health checks, logs, resource monitoring, and events.
docs/operator-public-documentation/preview/configuration/tls.md	Updates page description to better reflect supported TLS modes and content.
docs/operator-public-documentation/preview/configuration/storage.md	Updates page description to remove unsupported “volume expansion” claim.
docs/operator-public-documentation/preview/configuration/networking.md	Updates page description to highlight mongosh connection and Service types.

docs/operator-public-documentation/preview/operations/backup-and-restore.md

docs/operator-public-documentation/preview/operations/failover.md

mkdocs.yml

- Add missing metadata.name to ScheduledBackup example - Fix GitHub org in failover cross-links (microsoft -> documentdb) - Remove duplicate top-level 'Backup and Restore' nav entry from mkdocs.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

The content is now covered by the Operations section: - operations/backup-and-restore.md (backup, restore, retention) - operations/restore-deleted-cluster.md (PV recovery) Update cross-references in faq.md and storage.md to point to new paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

hossain-rayhan · 2026-03-18T18:39:47Z

docs/operator-public-documentation/preview/operations/backup-and-restore.md

+List backups for your DocumentDB cluster and choose one in `completed` status:
+
+```bash
+kubectl get backups -n default


Nit: is it in default namespace?

hossain-rayhan · 2026-03-18T18:48:06Z

docs/operator-public-documentation/preview/operations/upgrades.md

+### Step 4: Upgrade the DocumentDB Operator
+
+```bash
+helm upgrade documentdb-operator documentdb/documentdb-operator \


should we add helm upgrade --skip-crds as we upgraded the CRDs manually above?

hossain-rayhan · 2026-03-18T18:49:50Z

docs/operator-public-documentation/preview/operations/upgrades.md

+```
+
+### Rollback and Recovery
+


I think for automatic rollback we can utilize helm upgrade my-release my-chart --atomic?

hossain-rayhan · 2026-03-18T18:51:19Z

docs/operator-public-documentation/preview/operations/upgrades.md

+    spec:
+      gatewayImage: "ghcr.io/documentdb/documentdb-kubernetes-operator/gateway:<version>"
+    ```
+


Should we talk about DocumentDB Cluster udpate? Once the operator or schema updates are done, we want to migrate cluster to newer versions.

xgerman · 2026-03-18T23:00:16Z

docs/operator-public-documentation/preview/operations/backup-and-restore.md

+
+Backups protect your DocumentDB cluster against data loss from accidental deletion, corruption, or failed upgrades. A reliable backup strategy is the foundation of any production deployment — without it, recovery may be impossible.
+
+The DocumentDB operator provides a snapshot-based backup system built on Kubernetes [VolumeSnapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/). Each backup captures a point-in-time copy of the primary instance's persistent volume, which can later be used to bootstrap a new DocumentDB cluster.


point-in-time might not be th best word since it sounds like point-in-time restore... at a minimum explain that the data accumulated after a backup and before a crash might be lost

xgerman · 2026-03-18T23:00:59Z

docs/operator-public-documentation/preview/operations/backup-and-restore.md

+
+## Prerequisites
+
+Before creating backups, ensure your Kubernetes cluster has the required snapshot infrastructure.


s/infrastrucure/support/g

xgerman · 2026-03-18T23:02:57Z

docs/operator-public-documentation/preview/operations/failover.md

+## Local Automatic Failover
+
+Local automatic failover requires at least two instances (`spec.instancesPerNode >= 2`). With a single instance, there is only the primary and no replica available to promote — so failover is not possible. When multiple instances are running, the operator automatically promotes a replica to primary if the current primary becomes unavailable.
+


we recomend to match the # of local replicas to the number of availability zones

xgerman · 2026-03-18T23:04:05Z

docs/operator-public-documentation/preview/operations/failover.md

+
+In a multi-region setup:
+
+- One DocumentDB cluster is designated as the **primary** and handles all writes.


primary can be setup as a "HA cluster" thus having replicas providing local HA and only necessitating a faiolver to another region under extraordinary cisrcumstances...

xgerman · 2026-03-18T23:05:31Z

docs/operator-public-documentation/preview/operations/maintenance.md

+
+## Log Management
+
+=== "DocumentDB Operator Logs"


We ecommend to set up a centralzied lof colelction as part of your observability strategy (see observanilty chapter)

xgerman · 2026-03-18T23:06:26Z

docs/operator-public-documentation/preview/operations/maintenance.md

+
+```yaml
+spec:
+  logLevel: "info"  # Options: debug, info, warning, error


why do we default to info? In prod it should run warn or error?

xgerman · 2026-03-18T23:07:45Z

docs/operator-public-documentation/preview/operations/upgrades.md

+```
+
+### Step 2: Review Available Versions
+


Note: per release polciy (see ...) we only support ...

xgerman · 2026-03-18T23:09:03Z

docs/operator-public-documentation/preview/operations/upgrades.md

+
+| Upgrade Type | What Changes | How to Trigger |
+|-------------|-------------|----------------|
+| **DocumentDB operator** | The Kubernetes operator itself | Helm chart upgrade |


please also specify tht we upgarde CNPG for you - is there a way to skip that?

xgerman · 2026-03-18T23:11:00Z

docs/operator-public-documentation/preview/operations/upgrades.md

+
+## Component Upgrades
+
+Updating `spec.documentDBVersion` upgrades **both** the DocumentDB extension and the gateway together, since they share the same version.


we shoudl probably explain how we ensure that everyhting is deployed before we upgrade the scheam on multi-region. This statement youw rote is confusing because it impleas the schema gets updated automatically hich we don't want in multi-region

xgerman · 2026-03-18T23:11:39Z

docs/operator-public-documentation/preview/operations/upgrades.md

+1. You update the `spec.documentDBVersion` field.
+2. The operator detects the version change and updates both the database image and the gateway sidecar image.
+3. The underlying cluster manager performs a **rolling restart**: replicas are restarted first one at a time, then the **primary is restarted in place**. Expect a brief period of downtime while the primary pod restarts.
+4. After the primary pod is healthy, the operator runs `ALTER EXTENSION documentdb UPDATE` to update the database schema.


what in multi-region?

backup-and-restore.md: - Replace 'point-in-time copy' with 'crash-consistent snapshot' and clarify that PITR is not supported (data loss between snapshot and failure) - s/infrastructure/support/ in prerequisites - Use <namespace> placeholder instead of hardcoded 'default' failover.md: - Add tip: match instancesPerNode to number of availability zones - Clarify that primary cluster can itself be multi-instance HA, reducing need for cross-region failover maintenance.md: - Add centralized log collection recommendation with link to telemetry playground - Change logLevel example to 'warning' and add production tip upgrades.md: - Document that CloudNative-PG is bundled and upgraded automatically - Add release strategy support window note - Add --skip-crds to helm upgrade (CRDs are applied manually) - Add --atomic tip for automatic rollback - Add cross-link from operator upgrade to component upgrades - Document multi-region upgrade order (standbys first, primary last) - Document multi-region schema migration behavior (primary-only) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

helm upgrade does not touch CRDs at all (per Helm docs), so --skip-crds is a no-op and misleading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

WentingWu666666 changed the title ~~docs: add operations documentation (failover, maintenance, upgrades, backup, restore)~~ docs: add operations documentation guides Mar 18, 2026

WentingWu666666 marked this pull request as ready for review March 18, 2026 18:16

WentingWu666666 requested review from alaye-ms, hossain-rayhan and xgerman as code owners March 18, 2026 18:16

Copilot AI review requested due to automatic review settings March 18, 2026 18:16

Copilot started reviewing on behalf of WentingWu666666 March 18, 2026 18:17 View session

wentingwu000 and others added 14 commits March 18, 2026 14:20

Copilot AI reviewed Mar 18, 2026

View reviewed changes

docs/operator-public-documentation/preview/operations/backup-and-restore.md Show resolved Hide resolved

docs/operator-public-documentation/preview/operations/failover.md Outdated Show resolved Hide resolved

mkdocs.yml Outdated Show resolved Hide resolved

WentingWu666666 force-pushed the wentingwu/issue-253-operations-docs branch from cf17f84 to 39f59a0 Compare March 18, 2026 18:22

wentingwu000 and others added 2 commits March 18, 2026 14:24

hossain-rayhan reviewed Mar 18, 2026

View reviewed changes

xgerman requested changes Mar 18, 2026

View reviewed changes

wentingwu000 and others added 2 commits March 19, 2026 10:24

docs: remove unnecessary --skip-crds from helm upgrade

d136d92

helm upgrade does not touch CRDs at all (per Helm docs), so --skip-crds is a no-op and misleading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>


		Backups protect your DocumentDB cluster against data loss from accidental deletion, corruption, or failed upgrades. A reliable backup strategy is the foundation of any production deployment — without it, recovery may be impossible.

		The DocumentDB operator provides a snapshot-based backup system built on Kubernetes [VolumeSnapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/). Each backup captures a point-in-time copy of the primary instance's persistent volume, which can later be used to bootstrap a new DocumentDB cluster.


		## Prerequisites

		Before creating backups, ensure your Kubernetes cluster has the required snapshot infrastructure.

		## Local Automatic Failover

		Local automatic failover requires at least two instances (`spec.instancesPerNode >= 2`). With a single instance, there is only the primary and no replica available to promote — so failover is not possible. When multiple instances are running, the operator automatically promotes a replica to primary if the current primary becomes unavailable.


		In a multi-region setup:

		- One DocumentDB cluster is designated as the primary and handles all writes.


		## Component Upgrades

		Updating `spec.documentDBVersion` upgrades both the DocumentDB extension and the gateway together, since they share the same version.

Conversation

WentingWu666666 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Documentation

Verification

Key decisions

Also includes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WentingWu666666 commented Mar 12, 2026 •

edited

Loading